This document provides a comprehensive, step-by-step guide to building a binary classification model using the tidymodels ecosystem in R. We will tackle the classic Titanic dataset, a rich collection of passenger data, to predict survival. This project will walk you through the entire machine learning pipeline, from initial data loading and exploratory data analysis (EDA) to sophisticated model training, hyperparameter tuning, and final model evaluation and deployment.
The primary goal is to demonstrate a robust and reproducible workflow for classification tasks. We will explore various algorithms, compare their performance, and select the best model for making predictions on unseen data.
2 Installing and Loading Libraries
2.1 Step 1: Install Necessary Libraries
First, we ensure that all the required R packages are installed. We use the pak package for efficient and reliable package management.
Code
# This chunk installs all necessary packages. # It's set to not evaluate by default to avoid re-installing every time.if (!requireNamespace("pak", quietly =TRUE)) {install.packages("pak")}# pak::pak handles dependencies gracefully.pak::pak(c("tidymodels", "tidyverse", "titanic", "here", "naniar", "kableExtra", "ranger", "xgboost", "kknn", "kernlab", "doParallel", "discrim", "glmnet", "future.apply", "reactable", "skimr", "themis"))
2.2 Step 2: Load Libraries
Next, we load the libraries that will be used throughout the analysis. Each library serves a specific purpose, from data manipulation (tidyverse) to modeling (tidymodels) and creating interactive tables (reactable).
Code
library(tidymodels) # Core package for modelinglibrary(tidyverse) # Data manipulation and visualizationlibrary(titanic) # Contains the Titanic datasetlibrary(here) # For reproducible file pathslibrary(naniar) # For visualizing missing datalibrary(kableExtra) # For creating beautiful tableslibrary(discrim) # For discriminant analysis modelslibrary(glmnet) # For regularized regression modelslibrary(reactable) # For interactive tableslibrary(future) # For parallel processinglibrary(skimr) # For summary statisticslibrary(themis) # For dealing with imbalanced data
3 Data Preparation
3.1 Step 3: Read Data
We load the Titanic dataset, which is stored in a CSV file. The here package helps in locating the file relative to the project’s root directory, making the code more portable.
Code
titanic_df <-read_csv(here("../data", "titanic_data.csv"))# Glimpse the data to see its structure and variable typesglimpse(titanic_df)
EDA is a crucial step to understand the data’s underlying structure, identify missing values, and uncover relationships between variables.
Code
# Get a quick summary of the datasetsummary(titanic_df)
PassengerId Survived Pclass Name
Min. : 1.0 Min. :0.0000 Min. :1.000 Length:891
1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000 Class :character
Median :446.0 Median :0.0000 Median :3.000 Mode :character
Mean :446.0 Mean :0.3838 Mean :2.309
3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
Max. :891.0 Max. :1.0000 Max. :3.000
Sex Age SibSp Parch
Length:891 Min. : 0.42 Min. :0.000 Min. :0.0000
Class :character 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000
Mode :character Median :28.00 Median :0.000 Median :0.0000
Mean :29.70 Mean :0.523 Mean :0.3816
3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000
Max. :80.00 Max. :8.000 Max. :6.0000
NA's :177
Ticket Fare Cabin Embarked
Length:891 Min. : 0.00 Length:891 Length:891
Class :character 1st Qu.: 7.91 Class :character Class :character
Mode :character Median : 14.45 Mode :character Mode :character
Mean : 32.20
3rd Qu.: 31.00
Max. :512.33
Code
# Use skimr for more detailed and user-friendly summary statisticsskim(titanic_df)
Data summary
Name
titanic_df
Number of rows
891
Number of columns
12
_______________________
Column type frequency:
character
5
numeric
7
________________________
Group variables
None
Variable type: character
skim_variable
n_missing
complete_rate
min
max
empty
n_unique
whitespace
Name
0
1.00
12
82
0
891
0
Sex
0
1.00
4
6
0
2
0
Ticket
0
1.00
3
18
0
681
0
Cabin
687
0.23
1
15
0
147
0
Embarked
2
1.00
1
1
0
3
0
Variable type: numeric
skim_variable
n_missing
complete_rate
mean
sd
p0
p25
p50
p75
p100
hist
PassengerId
0
1.0
446.00
257.35
1.00
223.50
446.00
668.5
891.00
▇▇▇▇▇
Survived
0
1.0
0.38
0.49
0.00
0.00
0.00
1.0
1.00
▇▁▁▁▅
Pclass
0
1.0
2.31
0.84
1.00
2.00
3.00
3.0
3.00
▃▁▃▁▇
Age
177
0.8
29.70
14.53
0.42
20.12
28.00
38.0
80.00
▂▇▅▂▁
SibSp
0
1.0
0.52
1.10
0.00
0.00
0.00
1.0
8.00
▇▁▁▁▁
Parch
0
1.0
0.38
0.81
0.00
0.00
0.00
0.0
6.00
▇▁▁▁▁
Fare
0
1.0
32.20
49.69
0.00
7.91
14.45
31.0
512.33
▇▁▁▁▁
3.2.1 Automated EDA with DataExplorer
The DataExplorer package can generate a comprehensive HTML report, providing a deep dive into the data with a single command.
Code
# This command generates a detailed EDA report.create_report(titanic_df, y ="Survived")
3.2.2 Visualizing Missing Data and Key Relationships
Code
# Visualize missing data to understand its extentgg_miss_var(titanic_df)
Code
# Explore survival rate by passenger classtitanic_df %>%mutate(Survived =as.factor(Survived)) %>%ggplot(aes(x = Pclass, fill = Survived)) +geom_bar(position ="fill") +labs(y ="Proportion", title ="Survival Rate by Passenger Class")
Code
# Explore survival rate by sextitanic_df %>%mutate(Survived =as.factor(Survived)) %>%ggplot(aes(x = Sex, fill = Survived)) +geom_bar(position ="fill") +labs(y ="Proportion", title ="Survival Rate by Sex")
Code
# Explore age distribution by survivaltitanic_df %>%mutate(Survived =as.factor(Survived)) %>%ggplot(aes(x = Age, fill = Survived)) +geom_histogram(binwidth =5, position ="identity", alpha =0.7) +labs(title ="Age Distribution by Survival")
3.3 Step 5: Data Cleaning and Feature Engineering
We perform some initial data cleaning, such as converting variables to the correct type. More complex preprocessing will be handled by the recipe.
Identifier Variables: PassengerId, Name, Ticket (to be excluded from modeling)
3.4 Step 6: Data Splitting
To properly evaluate our models, we split the data into three sets: a training set (for model building), a testing set (for tuning and initial evaluation), and a final hold-out set (for unbiased final validation).
Code
set.seed(123) # for reproducibility# First, split into training/testing (90%) and hold-out (10%)split_ho <-initial_split(titanic_clean, prop =0.9, strata = Survived)train_test_data <-training(split_ho)holdout_data <-testing(split_ho)# Next, split the 90% into training (7/9) and testing (2/9)# This results in overall proportions of 70% train, 20% testsplit_tt <-initial_split(train_test_data, prop =7/9, strata = Survived)train_data <-training(split_tt)test_data <-testing(split_tt)# Create cross-validation folds from the training data for robust evaluationfolds <-vfold_cv(train_data, v =5, strata = Survived)# Check the proportions of each datasetcat("Training data:", nrow(train_data), "rows")
Training data: 622 rows
Code
cat("Testing data:", nrow(test_data), "rows")
Testing data: 179 rows
Code
cat("Hold-out data:", nrow(holdout_data), "rows")
Hold-out data: 90 rows
4 Modeling
4.1 Step 7: Create a Preprocessing Recipe
A recipe is a powerful tool from tidymodels for defining a sequence of data preprocessing steps. This ensures that the same transformations are applied consistently to all data splits.
Our recipe will: - Impute missing Age values using K-Nearest Neighbors. - Impute missing Embarked values using the mode. - Create dummy variables for all categorical predictors. - Normalize all numeric predictors to have a mean of 0 and standard deviation of 1. - Assign identifier roles to columns that should not be used as predictors.
Code
titanic_recipe <-recipe(Survived ~ ., data = train_data) %>%update_role(c(PassengerId, Name, Ticket), new_role ="sample ID") %>%step_impute_knn(Age, neighbors =5) %>%step_impute_mode(Embarked) %>%step_dummy(all_nominal_predictors()) %>%step_normalize(all_numeric_predictors())# Check the recipe to see the transformationssummary(prep(titanic_recipe))
# A tibble: 13 × 4
variable type role source
<chr> <list> <chr> <chr>
1 Age <chr [2]> predictor original
2 SibSp <chr [2]> predictor original
3 Parch <chr [2]> predictor original
4 Fare <chr [2]> predictor original
5 PassengerId <chr [2]> sample ID original
6 Name <chr [3]> sample ID original
7 Ticket <chr [3]> sample ID original
8 Survived <chr [3]> outcome original
9 Pclass_X2 <chr [2]> predictor derived
10 Pclass_X3 <chr [2]> predictor derived
11 Sex_male <chr [2]> predictor derived
12 Embarked_Q <chr [2]> predictor derived
13 Embarked_S <chr [2]> predictor derived
Code
# Bake the recipe to see the transformed datatitanic_recipe %>%prep() %>%bake(new_data = test_data) %>%head() %>%glimpse()
We will define a suite of classification models to compare their performance on this dataset.
4.2.1 Model Introductions
Logistic Regression: A linear model that predicts the probability of a binary outcome. It’s simple, interpretable, and a great baseline.
K-Nearest Neighbors (KNN): A non-parametric algorithm that classifies a data point based on the majority class of its ‘k’ nearest neighbors.
Support Vector Machines (SVM): A model that finds an optimal hyperplane to separate classes in the feature space. We’ll use a radial basis function kernel.
Decision Trees: A tree-like model of decisions. Simple to understand and visualize, but can be prone to overfitting.
Random Forest: An ensemble of many decision trees. It improves upon single trees by averaging their results, reducing overfitting and increasing accuracy. We will tune its hyperparameters.
XGBoost: An advanced and highly efficient implementation of gradient boosting. It’s known for its high performance and is a popular choice in competitions. We will tune its hyperparameters.
Neural Networks: A model inspired by the human brain, consisting of interconnected nodes (neurons) in layers. It can capture complex non-linear relationships.
Naive Bayes: A probabilistic classifier based on Bayes’ theorem with a ‘naive’ assumption of conditional independence between features. It is very fast and works well with high-dimensional data.
Regularized Logistic Regression: An extension of logistic regression that includes a penalty term (L1/Lasso or L2/Ridge) to prevent overfitting by shrinking coefficient estimates. We will tune the penalty and mixture parameters.
A workflow_set is a powerful and convenient way to manage multiple workflows (recipe + model combinations) simultaneously, making it easy to train and compare them.
We use workflow_map() to efficiently fit the non-tunable models and tune the hyperparameters for the tunable ones using grid search. We’ll leverage parallel processing to speed up this step.
Code
# Set up parallel processing to speed up computationplan(multisession)# Define the metrics we want to trackclass_metrics <-metric_set(accuracy, roc_auc, sensitivity, precision)# Tune the workflows using the cross-validation foldsset.seed(234)tune_results <-workflow_map( all_workflows,"tune_grid",resamples = folds,grid =15, # Use a grid of 15 candidate parameter sets for tuningmetrics = class_metrics,control =control_grid(save_pred =TRUE) # Save predictions for later analysis)
5 Model Evaluation
5.1 Step 11: Compare Model Performance
Let’s visualize and compare the performance of all our trained models.
Code
# Plot the results to visually compare model performance across metricsautoplot(tune_results)
5.1.1 Understanding the Metrics
Before diving into the results, it’s crucial to understand what each metric signifies. For binary classification, we often use a confusion matrix to evaluate a model’s performance.
Predicted: Survived
Predicted: Perished
Actual: Survived
True Positive (TP)
False Negative (FN)
Actual: Perished
False Positive (FP)
True Negative (TN)
AUC - ROC (Area Under the Receiver Operating Characteristic Curve): This is a key metric for classification. It measures the model’s ability to distinguish between classes. An AUC of 1.0 is a perfect model, while 0.5 is no better than random chance.
Accuracy: The proportion of correct predictions. It can be misleading if classes are imbalanced. (TP + TN) / (TP + TN + FP + FN)
Precision: Of all passengers predicted to survive, how many actually did? TP / (TP + FP)
Recall (Sensitivity): Of all passengers who actually survived, how many did our model correctly identify? TP / (TP + FN)
Code
# Rank models by accuracy and display in an interactive tablerank_results(tune_results, rank_metric ="accuracy") %>%filter(.metric =="accuracy") %>%select(-n, -std_err, -preprocessor) %>%reactable(defaultPageSize =10,striped =TRUE,highlight =TRUE,compact =TRUE,columns =list(wflow_id =colDef(name ="Workflow"),.metric =colDef(name ="Metric"),mean =colDef(name ="Mean", format =colFormat(digits =4)),rank =colDef(name ="Rank") ),defaultSorted =list(rank ="asc") )
Code
# Collect all metrics for further analysisall_metrics <-collect_metrics(tune_results)
5.1.2 ROC Curves
We can plot the ROC curves for each model to visualize the trade-off between the true positive rate and the false positive rate.
Based on our evaluation, XGBoost performed the best. We will now select this model, find its optimal hyperparameters, and fit it to the entire training and testing dataset.
Code
# Select the best workflow based on accuracybest_wflow_id <-rank_results(tune_results, rank_metric ="accuracy") %>% dplyr::slice(1) %>%pull(wflow_id)best_workflow <-extract_workflow(tune_results, id = best_wflow_id)# Find the best hyperparameters for the top modelbest_result <-extract_workflow_set_result(tune_results, id = best_wflow_id)best_params <-select_best(best_result, metric ="accuracy")best_params
# A tibble: 1 × 4
mtry trees min_n .config
<int> <int> <int> <chr>
1 2 1571 10 Preprocessor1_Model03
Code
# Finalize the workflow with the best parametersfinal_workflow <-finalize_workflow(best_workflow, best_params)final_workflow
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()
── Preprocessor ────────────────────────────────────────────────────────────────
4 Recipe Steps
• step_impute_knn()
• step_impute_mode()
• step_dummy()
• step_normalize()
── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (classification)
Main Arguments:
mtry = 2
trees = 1571
min_n = 10
Computational engine: ranger
Code
# Fit the final model on the combined training and testing datafinal_fit <-fit(final_workflow, data = train_test_data)# Save the final model object for future usesaveRDS(final_fit, file =here("../data/best_titanic_model.rds"))
6.2 Step 13: Load the Saved Model
We can now load the saved model object for deployment or future predictions.
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()
── Preprocessor ────────────────────────────────────────────────────────────────
4 Recipe Steps
• step_impute_knn()
• step_impute_mode()
• step_dummy()
• step_normalize()
── Model ───────────────────────────────────────────────────────────────────────
Ranger result
Call:
ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~2L, x), num.trees = ~1571L, min.node.size = min_rows(~10L, x), num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE)
Type: Probability estimation
Number of trees: 1571
Sample size: 801
Number of independent variables: 9
Mtry: 2
Target node size: 10
Variable importance mode: none
Splitrule: gini
OOB prediction error (Brier s.): 0.1290632
6.3 Step 14: Make Predictions on Hold-out Data
Finally, we evaluate our best model on the hold-out data, which it has never seen before. This provides the most realistic estimate of its performance on new, unseen data.
Code
# Make predictions on the hold-out setholdout_predictions <-predict(loaded_model, new_data = holdout_data) %>%bind_cols(predict(loaded_model, new_data = holdout_data, type ="prob")) %>%bind_cols(holdout_data %>%select(Survived))# Calculate performance metrics on the hold-out datametrics <-metric_set(accuracy, precision, recall, f_meas, roc_auc)final_performance <-metrics(holdout_predictions, truth = Survived, estimate = .pred_class, .pred_0)# Display the final performance in a clean tablefinal_performance %>%kbl(caption ="Final Model Performance on Hold-out Data") %>%kable_styling(bootstrap_options ="striped", full_width =FALSE)
Final Model Performance on Hold-out Data
.metric
.estimator
.estimate
accuracy
binary
0.8000000
precision
binary
0.7846154
recall
binary
0.9272727
f_meas
binary
0.8500000
roc_auc
binary
0.8576623
Code
# Visualize the confusion matrixconf_mat(holdout_predictions, truth = Survived, estimate = .pred_class) %>%autoplot(type ="heatmap")
6.4 Step 15: Decile Analysis
To better understand how well our models discriminate between classes, we perform a decile analysis. We sort predictions by the probability of survival, group them into 10 deciles, and then calculate the actual survival rate within each decile. This helps to visualize if the models are effectively identifying high-risk and low-risk groups.
Code
# Get predictions from all models on the test setall_predictions <- tune_results %>%collect_predictions()# Calculate the overall average survival rateoverall_avg_survival <-mean(as.numeric(as.character(all_predictions$Survived)))# Calculate deciles and survival rates for each modeldecile_data <- all_predictions %>%mutate(Survived_numeric =as.numeric(as.character(Survived))) %>%group_by(wflow_id) %>%mutate(decile =ntile(desc(.pred_1), 10)) %>%group_by(wflow_id, decile) %>%summarise(avg_pred_prob =mean(.pred_1),actual_survival_rate =mean(Survived_numeric),n =n(),.groups ="drop" ) %>%mutate(wflow_id =str_remove(wflow_id, "recipe_"))# Visualize the decile analysisggplot(decile_data, aes(x =as.factor(decile), y = actual_survival_rate)) +geom_col(aes(fill = wflow_id), show.legend =FALSE) +geom_hline(aes(yintercept = overall_avg_survival, linetype ="Overall Average"), color ="gray50") +scale_y_continuous(labels = scales::percent) +facet_wrap(~wflow_id, ncol =3) +labs(title ="Decile Analysis: Actual Survival Rate by Predicted Decile",subtitle ="Decile 1 contains the 10% of predictions with the highest probability of survival.",x ="Decile (Ranked by Predicted Probability of Surviving)",y ="Actual Survival Rate",linetype ="" ) +theme_minimal() +theme(legend.position ="bottom")
7 Conclusion
In this analysis, we successfully built and evaluated multiple machine learning models to predict Titanic survivors. The tidymodels framework provided a powerful and organized workflow for this task. Our final XGBoost model demonstrated strong performance on the hold-out data, indicating its effectiveness in this classification problem. This project serves as a template for tackling similar binary classification challenges in a structured and reproducible manner.
Source Code
---title: "Binary Classification Model: Predicting Titanic Survivors with R"subtitle: "Using Tidymodels"execute: warning: false error: falseformat: html: toc: true toc-location: right code-fold: show code-tools: true number-sections: true code-block-bg: true code-block-border-left: "#31BAE9"---## IntroductionThis document provides a comprehensive, step-by-step guide to building a binary classification model using the `tidymodels` ecosystem in R. We will tackle the classic Titanic dataset, a rich collection of passenger data, to predict survival. This project will walk you through the entire machine learning pipeline, from initial data loading and exploratory data analysis (EDA) to sophisticated model training, hyperparameter tuning, and final model evaluation and deployment.The primary goal is to demonstrate a robust and reproducible workflow for classification tasks. We will explore various algorithms, compare their performance, and select the best model for making predictions on unseen data.## Installing and Loading Libraries### Step 1: Install Necessary LibrariesFirst, we ensure that all the required R packages are installed. We use the `pak` package for efficient and reliable package management.```{r}#| label: install-packages#| eval: false# This chunk installs all necessary packages. # It's set to not evaluate by default to avoid re-installing every time.if (!requireNamespace("pak", quietly =TRUE)) {install.packages("pak")}# pak::pak handles dependencies gracefully.pak::pak(c("tidymodels", "tidyverse", "titanic", "here", "naniar", "kableExtra", "ranger", "xgboost", "kknn", "kernlab", "doParallel", "discrim", "glmnet", "future.apply", "reactable", "skimr", "themis"))```### Step 2: Load LibrariesNext, we load the libraries that will be used throughout the analysis. Each library serves a specific purpose, from data manipulation (`tidyverse`) to modeling (`tidymodels`) and creating interactive tables (`reactable`).```{r}#| label: load-librarieslibrary(tidymodels) # Core package for modelinglibrary(tidyverse) # Data manipulation and visualizationlibrary(titanic) # Contains the Titanic datasetlibrary(here) # For reproducible file pathslibrary(naniar) # For visualizing missing datalibrary(kableExtra) # For creating beautiful tableslibrary(discrim) # For discriminant analysis modelslibrary(glmnet) # For regularized regression modelslibrary(reactable) # For interactive tableslibrary(future) # For parallel processinglibrary(skimr) # For summary statisticslibrary(themis) # For dealing with imbalanced data```## Data Preparation### Step 3: Read DataWe load the Titanic dataset, which is stored in a CSV file. The `here` package helps in locating the file relative to the project's root directory, making the code more portable.```{r}#| label: read-datatitanic_df <-read_csv(here("../data", "titanic_data.csv"))# Glimpse the data to see its structure and variable typesglimpse(titanic_df)```### Step 4: Exploratory Data Analysis (EDA)EDA is a crucial step to understand the data's underlying structure, identify missing values, and uncover relationships between variables.```{r}#| label: eda-summary# Get a quick summary of the datasetsummary(titanic_df)``````{r}# Use skimr for more detailed and user-friendly summary statisticsskim(titanic_df)```#### Automated EDA with `DataExplorer`The `DataExplorer` package can generate a comprehensive HTML report, providing a deep dive into the data with a single command.```{r}#| label: dataexplorer-report#| eval: false# This command generates a detailed EDA report.create_report(titanic_df, y ="Survived")``````{=html}<iframe width="900" height="600" src="report.html" title="Auto EDA report"></iframe>```#### Visualizing Missing Data and Key Relationships```{r}# Visualize missing data to understand its extentgg_miss_var(titanic_df)``````{r}#| label: eda-plots# Explore survival rate by passenger classtitanic_df %>%mutate(Survived =as.factor(Survived)) %>%ggplot(aes(x = Pclass, fill = Survived)) +geom_bar(position ="fill") +labs(y ="Proportion", title ="Survival Rate by Passenger Class")# Explore survival rate by sextitanic_df %>%mutate(Survived =as.factor(Survived)) %>%ggplot(aes(x = Sex, fill = Survived)) +geom_bar(position ="fill") +labs(y ="Proportion", title ="Survival Rate by Sex")# Explore age distribution by survivaltitanic_df %>%mutate(Survived =as.factor(Survived)) %>%ggplot(aes(x = Age, fill = Survived)) +geom_histogram(binwidth =5, position ="identity", alpha =0.7) +labs(title ="Age Distribution by Survival")```### Step 5: Data Cleaning and Feature EngineeringWe perform some initial data cleaning, such as converting variables to the correct type. More complex preprocessing will be handled by the `recipe`.```{r}#| label: data-cleaningtitanic_clean <- titanic_df %>%mutate(Survived =as.factor(Survived),Pclass =as.factor(Pclass) ) %>%select(Survived, Pclass, Sex, Age, SibSp, Parch, Fare, Embarked, PassengerId, Name, Ticket)glimpse(titanic_clean)```#### Variable Groups- **Target Variable**: `Survived`- **Numerical Features**: `Age`, `SibSp`, `Parch`, `Fare`- **Categorical Features**: `Pclass`, `Sex`, `Embarked`- **Identifier Variables**: `PassengerId`, `Name`, `Ticket` (to be excluded from modeling)### Step 6: Data SplittingTo properly evaluate our models, we split the data into three sets: a training set (for model building), a testing set (for tuning and initial evaluation), and a final hold-out set (for unbiased final validation).```{r}#| label: data-splitset.seed(123) # for reproducibility# First, split into training/testing (90%) and hold-out (10%)split_ho <-initial_split(titanic_clean, prop =0.9, strata = Survived)train_test_data <-training(split_ho)holdout_data <-testing(split_ho)# Next, split the 90% into training (7/9) and testing (2/9)# This results in overall proportions of 70% train, 20% testsplit_tt <-initial_split(train_test_data, prop =7/9, strata = Survived)train_data <-training(split_tt)test_data <-testing(split_tt)# Create cross-validation folds from the training data for robust evaluationfolds <-vfold_cv(train_data, v =5, strata = Survived)# Check the proportions of each datasetcat("Training data:", nrow(train_data), "rows")cat("Testing data:", nrow(test_data), "rows")cat("Hold-out data:", nrow(holdout_data), "rows")```## Modeling### Step 7: Create a Preprocessing RecipeA `recipe` is a powerful tool from `tidymodels` for defining a sequence of data preprocessing steps. This ensures that the same transformations are applied consistently to all data splits.Our recipe will:- Impute missing `Age` values using K-Nearest Neighbors.- Impute missing `Embarked` values using the mode.- Create dummy variables for all categorical predictors.- Normalize all numeric predictors to have a mean of 0 and standard deviation of 1.- Assign identifier roles to columns that should not be used as predictors.```{r}#| label: create-recipetitanic_recipe <-recipe(Survived ~ ., data = train_data) %>%update_role(c(PassengerId, Name, Ticket), new_role ="sample ID") %>%step_impute_knn(Age, neighbors =5) %>%step_impute_mode(Embarked) %>%step_dummy(all_nominal_predictors()) %>%step_normalize(all_numeric_predictors())# Check the recipe to see the transformationssummary(prep(titanic_recipe))``````{r}# Bake the recipe to see the transformed datatitanic_recipe %>%prep() %>%bake(new_data = test_data) %>%head() %>%glimpse()```### Step 8: Define ModelsWe will define a suite of classification models to compare their performance on this dataset.#### Model Introductions- **Logistic Regression**: A linear model that predicts the probability of a binary outcome. It's simple, interpretable, and a great baseline.- **K-Nearest Neighbors (KNN)**: A non-parametric algorithm that classifies a data point based on the majority class of its 'k' nearest neighbors.- **Support Vector Machines (SVM)**: A model that finds an optimal hyperplane to separate classes in the feature space. We'll use a radial basis function kernel.- **Decision Trees**: A tree-like model of decisions. Simple to understand and visualize, but can be prone to overfitting.- **Random Forest**: An ensemble of many decision trees. It improves upon single trees by averaging their results, reducing overfitting and increasing accuracy. We will tune its hyperparameters.- **XGBoost**: An advanced and highly efficient implementation of gradient boosting. It's known for its high performance and is a popular choice in competitions. We will tune its hyperparameters.- **Neural Networks**: A model inspired by the human brain, consisting of interconnected nodes (neurons) in layers. It can capture complex non-linear relationships.- **Naive Bayes**: A probabilistic classifier based on Bayes' theorem with a 'naive' assumption of conditional independence between features. It is very fast and works well with high-dimensional data.- **Regularized Logistic Regression**: An extension of logistic regression that includes a penalty term (L1/Lasso or L2/Ridge) to prevent overfitting by shrinking coefficient estimates. We will tune the penalty and mixture parameters.#### Model Specifications```{r}#| label: model-specs# Logistic Regressionlog_reg_spec <-logistic_reg() %>%set_engine("glm") %>%set_mode("classification")# K-Nearest Neighborsknn_spec <-nearest_neighbor() %>%set_engine("kknn") %>%set_mode("classification")# Support Vector Machinesvm_spec <-svm_rbf() %>%set_engine("kernlab") %>%set_mode("classification")# Decision Treetree_spec <-decision_tree() %>%set_engine("rpart") %>%set_mode("classification")# Random Forest (Tunable)rf_spec <-rand_forest(mtry =tune(),trees =tune(),min_n =tune()) %>%set_engine("ranger") %>%set_mode("classification")# XGBoost (Tunable)xgb_spec <-boost_tree(trees =tune(),tree_depth =tune(),min_n =tune(),learn_rate =tune(),loss_reduction =tune()) %>%set_engine("xgboost") %>%set_mode("classification")# Neural Networknn_spec <-mlp(hidden_units =tune(), penalty =tune(), epochs =100) %>%set_engine("nnet") %>%set_mode("classification")# Naive Bayesnb_spec <-naive_Bayes() %>%set_engine("naivebayes") %>%set_mode("classification")# Regularized Logistic Regression (Tunable)glmnet_spec <-logistic_reg(penalty =tune(), mixture =tune()) %>%set_engine("glmnet") %>%set_mode("classification")```#### Model Comparison Table| Model | Key Strengths | Key Weaknesses | Tunable ||---|---|---|---|| Logistic Regression | Interpretable, fast, good baseline | Assumes linearity | No || K-Nearest Neighbors | Simple, non-parametric | Computationally intensive, needs scaling | No || SVM | Effective in high dimensions, robust | Can be slow, less interpretable | No || Decision Tree | Easy to interpret and visualize | Prone to overfitting | No || Random Forest | High accuracy, robust to overfitting | Less interpretable, can be slow | Yes || XGBoost | Top-tier performance, fast, flexible | Complex, many parameters to tune | Yes || Neural Network | Captures complex patterns, highly flexible | Requires large data, prone to overfitting | Yes || Naive Bayes | Very fast, works well on wide data | Strong feature independence assumption | No || Regularized Logistic Regression | Prevents overfitting, feature selection (Lasso) | Less interpretable than simple LR | Yes |### Step 9: Create a WorkflowSetA `workflow_set` is a powerful and convenient way to manage multiple workflows (recipe + model combinations) simultaneously, making it easy to train and compare them.```{r}#| label: create-workflowsetall_models <-list(log_reg = log_reg_spec,knn = knn_spec,svm = svm_spec,tree = tree_spec,rf = rf_spec,xgb = xgb_spec,nn = nn_spec,nb = nb_spec,glmnet = glmnet_spec)all_workflows <-workflow_set(preproc =list(recipe = titanic_recipe),models = all_models)```### Step 10: Training and Hyperparameter TuningWe use `workflow_map()` to efficiently fit the non-tunable models and tune the hyperparameters for the tunable ones using grid search. We'll leverage parallel processing to speed up this step.```{r}#| label: training# Set up parallel processing to speed up computationplan(multisession)# Define the metrics we want to trackclass_metrics <-metric_set(accuracy, roc_auc, sensitivity, precision)# Tune the workflows using the cross-validation foldsset.seed(234)tune_results <-workflow_map( all_workflows,"tune_grid",resamples = folds,grid =15, # Use a grid of 15 candidate parameter sets for tuningmetrics = class_metrics,control =control_grid(save_pred =TRUE) # Save predictions for later analysis)```## Model Evaluation### Step 11: Compare Model PerformanceLet's visualize and compare the performance of all our trained models.```{r}#| label: show-performance# Plot the results to visually compare model performance across metricsautoplot(tune_results)```#### Understanding the MetricsBefore diving into the results, it's crucial to understand what each metric signifies. For binary classification, we often use a **confusion matrix** to evaluate a model's performance.|| Predicted: Survived | Predicted: Perished ||---|---|---|| **Actual: Survived** | True Positive (TP) | False Negative (FN) || **Actual: Perished** | False Positive (FP) | True Negative (TN) |- **AUC - ROC (Area Under the Receiver Operating Characteristic Curve)**: This is a key metric for classification. It measures the model's ability to distinguish between classes. An AUC of 1.0 is a perfect model, while 0.5 is no better than random chance.- **Accuracy**: The proportion of correct predictions. It can be misleading if classes are imbalanced. `(TP + TN) / (TP + TN + FP + FN)`- **Precision**: Of all passengers predicted to survive, how many actually did? `TP / (TP + FP)`- **Recall (Sensitivity)**: Of all passengers who actually survived, how many did our model correctly identify? `TP / (TP + FN)````{r}# Rank models by accuracy and display in an interactive tablerank_results(tune_results, rank_metric ="accuracy") %>%filter(.metric =="accuracy") %>%select(-n, -std_err, -preprocessor) %>%reactable(defaultPageSize =10,striped =TRUE,highlight =TRUE,compact =TRUE,columns =list(wflow_id =colDef(name ="Workflow"),.metric =colDef(name ="Metric"),mean =colDef(name ="Mean", format =colFormat(digits =4)),rank =colDef(name ="Rank") ),defaultSorted =list(rank ="asc") )# Collect all metrics for further analysisall_metrics <-collect_metrics(tune_results)```#### ROC CurvesWe can plot the ROC curves for each model to visualize the trade-off between the true positive rate and the false positive rate.```{r}#| label: roc-curvestune_results %>%collect_predictions() %>%group_by(wflow_id) %>%roc_curve(truth = Survived, .pred_0) %>%autoplot()```## Finalizing and Deploying the Model### Step 12: Select and Save the Best ModelBased on our evaluation, XGBoost performed the best. We will now select this model, find its optimal hyperparameters, and fit it to the entire training and testing dataset.```{r}#| label: finalize-best-model# Select the best workflow based on accuracybest_wflow_id <-rank_results(tune_results, rank_metric ="accuracy") %>% dplyr::slice(1) %>%pull(wflow_id)best_workflow <-extract_workflow(tune_results, id = best_wflow_id)# Find the best hyperparameters for the top modelbest_result <-extract_workflow_set_result(tune_results, id = best_wflow_id)best_params <-select_best(best_result, metric ="accuracy")best_params# Finalize the workflow with the best parametersfinal_workflow <-finalize_workflow(best_workflow, best_params)final_workflow# Fit the final model on the combined training and testing datafinal_fit <-fit(final_workflow, data = train_test_data)# Save the final model object for future usesaveRDS(final_fit, file =here("../data/best_titanic_model.rds"))```### Step 13: Load the Saved ModelWe can now load the saved model object for deployment or future predictions.```{r}#| label: load-modelloaded_model <-readRDS(here("../data/best_titanic_model.rds"))print(loaded_model)```### Step 14: Make Predictions on Hold-out DataFinally, we evaluate our best model on the hold-out data, which it has never seen before. This provides the most realistic estimate of its performance on new, unseen data.```{r}#| label: predict-holdout# Make predictions on the hold-out setholdout_predictions <-predict(loaded_model, new_data = holdout_data) %>%bind_cols(predict(loaded_model, new_data = holdout_data, type ="prob")) %>%bind_cols(holdout_data %>%select(Survived))# Calculate performance metrics on the hold-out datametrics <-metric_set(accuracy, precision, recall, f_meas, roc_auc)final_performance <-metrics(holdout_predictions, truth = Survived, estimate = .pred_class, .pred_0)# Display the final performance in a clean tablefinal_performance %>%kbl(caption ="Final Model Performance on Hold-out Data") %>%kable_styling(bootstrap_options ="striped", full_width =FALSE)# Visualize the confusion matrixconf_mat(holdout_predictions, truth = Survived, estimate = .pred_class) %>%autoplot(type ="heatmap")```### Step 15: Decile AnalysisTo better understand how well our models discriminate between classes, we perform a decile analysis. We sort predictions by the probability of survival, group them into 10 deciles, and then calculate the actual survival rate within each decile. This helps to visualize if the models are effectively identifying high-risk and low-risk groups.```{r}#| label: decile-analysis# Get predictions from all models on the test setall_predictions <- tune_results %>%collect_predictions()# Calculate the overall average survival rateoverall_avg_survival <-mean(as.numeric(as.character(all_predictions$Survived)))# Calculate deciles and survival rates for each modeldecile_data <- all_predictions %>%mutate(Survived_numeric =as.numeric(as.character(Survived))) %>%group_by(wflow_id) %>%mutate(decile =ntile(desc(.pred_1), 10)) %>%group_by(wflow_id, decile) %>%summarise(avg_pred_prob =mean(.pred_1),actual_survival_rate =mean(Survived_numeric),n =n(),.groups ="drop" ) %>%mutate(wflow_id =str_remove(wflow_id, "recipe_"))# Visualize the decile analysisggplot(decile_data, aes(x =as.factor(decile), y = actual_survival_rate)) +geom_col(aes(fill = wflow_id), show.legend =FALSE) +geom_hline(aes(yintercept = overall_avg_survival, linetype ="Overall Average"), color ="gray50") +scale_y_continuous(labels = scales::percent) +facet_wrap(~wflow_id, ncol =3) +labs(title ="Decile Analysis: Actual Survival Rate by Predicted Decile",subtitle ="Decile 1 contains the 10% of predictions with the highest probability of survival.",x ="Decile (Ranked by Predicted Probability of Surviving)",y ="Actual Survival Rate",linetype ="" ) +theme_minimal() +theme(legend.position ="bottom")```## ConclusionIn this analysis, we successfully built and evaluated multiple machine learning models to predict Titanic survivors. The `tidymodels` framework provided a powerful and organized workflow for this task. Our final XGBoost model demonstrated strong performance on the hold-out data, indicating its effectiveness in this classification problem. This project serves as a template for tackling similar binary classification challenges in a structured and reproducible manner.